A Polynomial Time Matching Algorithm of Structured Ordered Tree Patterns for Data Mining from Semistructured Data

نویسندگان

  • Yusuke Suzuki
  • Kohtaro Inomae
  • Takayoshi Shoudai
  • Tetsuhiro Miyahara
  • Tomoyuki Uchida
چکیده

Tree structured data such as HTML/XML files are represented by rooted trees with ordered children and edge labels. Knowledge representations for tree structured data are quite important to discover interesting features which such tree structured data have. In this paper, as a representation of structural features we propose a structured ordered tree pattern, called a term tree, which is a rooted tree pattern consisting of ordered children and structured variables. A variable in a term tree can be substituted by an arbitrary tree. Deciding whether or not each given tree structured data has structural features is a core problem for data mining of large tree structured data. We consider a problem of deciding whether or not a term tree t matches a tree T , that is, T is obtained from t by substituting some trees for variables in t. Such a problem is called a membership problem for t and T . Given a term tree t and a tree T , we present an O(nN) time algorithm of solving the membership problem for t and T , where n and N are the numbers of vertices in t and T , respectively. We also report some experiments on applying our matching algorithm to a collection of real Web documents.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Algorithms for Mining Semi-structured Data Stream

In this paper, we study an online data mining problem from streams of semi-structured data such as XML data. Modeling semi-structured data and patterns as labeled ordered trees, we present an online algorithm StreamT that receives fragments of an unseen possibly infinite semistructured data in the document order through a data stream, and can return the current set of frequent patterns immediat...

متن کامل

An Effective Grammar-Based Compression Algorithm for Tree Structured Data

Many semistructured data such as HTML/XML files are represented by rooted trees t such that all children of each internal vertex of t are ordered and all edges of t have labels. Such data is called tree structured data. Analyzing large tree structured data is a time-consuming process in data mining. If we can reduce the size of input data without loss of information, we can speed up such a heav...

متن کامل

Efficient Learning of Semi-structured Data from Queries

This paper studies the learning complexity of classes of structured patterns for HTML/ XML-trees in the query learning framework of Angluin. We present polynomial time learning algorithms for ordered gapped tree patterns, OGT, and ordered gapped forests, OGF, under the into-matching semantics using equivalence queries and subset queries. As a corollary, the learnability with equivalence and mem...

متن کامل

Extraction of Tag Tree Patterns with Contractible Variables from Irregular Semistructured Data

Information Extraction from semistructured data becomes more and more important. In order to extract meaningful or interesting contents from semistructured data, we need to extract common structured patterns from semistructured data. Many semistructured data have irregularities such as missing or erroneous data. A tag tree pattern is an edge labeled tree with ordered children which has tree str...

متن کامل

Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents

Many Web documents such as HTML files and XML files have no rigid structure and are called semistructured data. In general, such semistructured Web documents are represented by rooted trees with ordered children. We propose a new method for discovering frequent tree structured patterns in semistructured Web documents by using a tag tree pattern as a hypothesis. A tag tree pattern is an edge lab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002